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ABSTRACT 


We  show  that  the  leaving-one-out  method  of  pattern  recog¬ 
nition  must  yield  biased  results  when  the  two  sets  of  training 
data  (representing  two  classes  to  be  discriminated)  are  identi¬ 
cal.  This  phenomenon,  which  we  observed  during  a  study  of  the 
sensitivity  of  classification  results  to  errors  in  the  training 
data,  can  be  eliminated  by  generating  the  training  sets  indepen¬ 


dently. 


iii 


1. 


Introduction 


In  certain  studies  of  the  sensitivity  of  classification 
results  to  errors  in  the  data  submitted  for  classification, 
one  submits  for  training  two  sets  of  data  with  the  property 
that  the  second  set  has  been  derived  from  the  first  by 
corrupting  (adding  noise  or  bias  to)  the  first  set.  An  example 
of  such  a  study  arises  in  the  assessment  of  the  effect  of 
measurement  noise  on  the  accuracy  of  radar  static  patterns 
from  reentry  vehicles.  A  procedure  for  assessing  this  effect 
is  to  add  noise  of  increasing  magnitude  to  a  static  pattern 
defined  to  be  free  of  noise  and  then  submit  signatures  derived 
from  the  noisy  and  noise-free  patterns  to  a  classification 
algorithm.  So  long  as  the  "noisy"  signatures  cannot  be 
discriminated  (according  to  some  probability  of  error  criterion) 
from  the  noise-free  signatures,  the  noise  in  the  static  patterns 
is  accepted.  Noise  levels  which  lead  to  signatures  which  can 
be  discriminated  from  those  corresponding  to  a  noise-free 
pattern  arc  not  accepted.  The  leaving-one-out  (L-)  method 
of  classification  has  properties  which  make  it  suitable  for 
these  and  other  investigations:  it  makes  efficient  use  of  the  data 
which  is  desirable  when  there  are  not  many  available,  and 
its  expected  error  is  an  upper  bound  on  the  Bayes  classifica¬ 
tion  error.  (For  details,  see[l]  ,  Chap.  6.).  It  is  the 
purpose  of  this  note  to  justify  the  following  recommen¬ 
dation:  if  the  L-method  is  used  to  perform  discrimination  in 
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such  studies,  then  the  corrupted  data  should  be  generated  not 
from  the  uncorrupted  data,  but  from  a  set  of  data  which  is 
statistically  (but  not  identically)  equivalent  to  it.  The 
reason  for  this  recommendation  is  the  fact  (to  be  demonstrated 
below)  that  the  L-method  produces  strongly  biased  results 
when  it  is  exercised  upon  data  sets  which  are  identical,  or 
upon  those  with  the  property  that  one  set  is  a  (slight) 
corruption  of  the  other. 

2.  The  biasness  of  the  L-method  when  classes  are  repre¬ 
sented  by  identical  data  sets 
The  plan  of  this  section  is  to  show  first,  that  when  the 
two  classes  described  above  have  a  one-dimensional  distribu¬ 
tion,  the  Gaussian  density  estimate  for  the  class  from  which 
a  sample  has  been  left  out  is  smaller  than  the  Gaussian  density 
estimate  for  the  class  which  includes  the  sample.  (We  shall 
exclude  throughout  the  trivial  case  in  which  all  observa¬ 
tions  in  the  full  sample  are  identical.)  Since  reasonable 
classifiers  are  based  upon  the  comparison  (for  example,  through 
(log-)  likelihood  ratios)  of  density  estimators  evaluated  at 
points  in  the  test  set,  the  biasness  of  the  L-method  follows 
in  this  case.  Next  we  show,  using  the  previous  result,  that 
the  L-method  is  biased  in  these  circumstances  for  classes 
from  a  multi-dimensional  distribution  when  the  multi -variate 
Gaussian  density  estimator  is  used.  Finally,  we  motivate  the 
result  for  the  case  of  a  general  density  estimate  by  proving 
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it  true  for  the  Parzen  estimator.  We  expect  that  a  similar 
argument  can  be  found  to  prove  L-method  biasness  for 
identical  data  sets  in  the  case  of  other  reasonable  density 
estimators,  but  we  shall  not  do  that  here. 

A.  Classes  drawn  from  a  one-dimensional  distribution 

Let  x. ,  ...  ,  x„  , ,  xM  be  the  variables  in  the  class 

from  which  xN  is  left  out  during  the  training  (estimation) 

part  of  classifier  design.  Since  the  performance  of  the  L-method 

is  invariant  with  respect  to  shifts  of  the  data,  we  may  assume 
N  - 1 

that  (N-l)  T  x.  =  0. 

1^1  1 


There  are  then  three  cases  to  consider: 

Case  Al.  Xj^  =  x2  =  •••  =  xN1>  x^O  •  The  contribution  to 
the  Gaussian  estimate  at  xN  is  greater  than  zero,  so  that  the 
estimate  based  on  x^  ...»  x^_i»  XN  is  greater  than  that 

based  on  x^,  .  .  .  ,  x^ 

2  N  2 

Let  s„ =  (1/N) £  x.  and  assume  for  the  next  two  cases, 

N  i-1  1 

2 

without  loss  of  generality,  that  sN_^=l. 


Case  A2 .  x.,=0.  The  conclusion  follows  from  the  fact  that 

N 

sN<sn_i»  which  implies  that  the  N-point  density  estimate  is 
larger  than  the  (N-l) -point  density  estimate. 


Case  A3.  x^* 0  .  The  comparable  terms  in  the  (N-l)-and  N-point 

2 

density  estimates  are  exp(-Xjsj/2)  and 
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and  we  must  show  that  the  latter  is  greater  than  the  former. 

There  are  two  cases  to  consider. 

Case  A3i.  (N-l)/N  +  x^(N-1)/N2<1  .  Here  it  is  enough  to  show 
that  the  exponent  of  the  latter  is  smaller  than  that  of  the  former. 

This  follows  from 


(xn'xn/n) 


XN  l- 

__  y 
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whence  the  biasness  follows. 

Case  A3ii.  (N-l)/N  +  x^(N-1)/N2>1  .  Here  it  is  enough  to 


show  that 


2  2. 


\jr  +  xn(^)_  6Xp  t(t)  -  exp  (’xn/2)  • 


I 


This  is  the  same  as  showing  that 
2 


N  -  I, 

T  H 


,  [N-l  x  „ 2 / N - 1  \1  XN/N-ir  ^  „ 

log[ir  +  xk\^t)\  -  ylx /  -  0 


By  the  monotonicity  of  the  logarithm,  the  left  side  of  this 
inequality  is  greater  than 

2 


AN/2N-li  ,  ,  T  ^  2/N-l\ 

_r(__y.)  .  *  log  |^1  +  xN(-^)J 


which  by  the  same  monotonicity  is  greater  than 


aN  /2N-1v 


h  log  1  +  x 


But  this  is  of  the  form  -  log(l  +  y)j,  which  is  positive  for 
y>0  ,  and  the  biasness  follows  in  this  case  as  well. 


B.  Classes  drawn  from  a  multi - dimensional  distribution 

Let  the  estimates  of  the  covariance  matrices  of  the  two 

A 

design  classes  be  denoted  by  i=l,2.  One  can  always 

find  a  linear  transformation  of  the  data  which  simultaneously 

/N  /V 

diagonalizes  Z^  and  Z^  (even  if  one  or  both  are  singular). 

The  Gaussian  density  estimator  for  the  transformed  data  is  the 
product  of  one-dimensional  Gaussian  densities  and  we  can  there¬ 
fore  appeal  to  the  result  in  the  one-dimensional  case.  Suppose 
that  X  is  the  M-vector  deleted  by  the  L-method  and  A  is  the 
set  of  vectors  remaining  in  the  corresponding  class.  Let 
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P(B)  be  the  Gaussian  estimate  corresponding  to  a  data  set  B 

/\  4  /N 

and  P  (B)  the  projection  of  P(B)  onto  its  i-th  component. 

Let  Y  =  (y^,  ,  y^)  be  any  vector  in  A  and  put  Als(y^ :Y t  A) . 

Consider  any  component  of  X  and  note  that  if  A1  =  {x^}, 

then  P1  (A)  =  Px(AU{X}),  and  if  A^ixJ,  then  P1  (A)cP1  (A  U(X} ) 

at  x^.  Since  the  data  are  not  identical  by  assumption,  there 

exists  at  least  one  i  such  that  A1*  (x.).  Hence,  at  least  one 

1  * 

A 

of  the  factors  of  P (A)  is  smaller  than  the  corresponding 
factor  of  P(AU(Xl),  whence  P  (A)  <  P  (A  U  {X} )  at  X. 

C.  Classes  drawn  from  an  arbitrary  distribution;  the  result 

motivated  by  consideration  of  the  Parzen  density  estimator. 
In  the  general  case,  the  underlying  density  cannot  be 
summarized  by  its  first-and  second-order  moments,  and  it  is 
not  easy  to  choose  a  density  estimator  which  is  suitable  for 
all  reasonable  classification  procedures.  One  estimator  which 
has  earned  wide  acceptance  because  of  its  attractive  properties 
(asymptotic  unbiasedness  an-d  consistency)  is  the  Parzen 
estimator.  For  the  estimation  of  one-dimensional  densities 
by  the  Parzen  technique  one  usually  chooses  a  kernel  function 
which  has  a  unique  maximum  at  its  central  point.  The  Gaussian 
kernel  has  this  property  in  any  number  of  dimensions.  We 
shall  prove  below  L-method  biasness  for  data  whose  densities 
are  estimated  using  the  Gaussian  kernel  or  any  other  multi¬ 
dimensional  kernel  which  has  a  unique  maximum  at  its  central 
point . 
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Suppose  that  the  class  in  question  is  represented  by  N-l 

data  points  x^ ,  .  .  .  ,  plus  the  point  x^,  which  is  deleted 

during  estimation  by  the  L-method.  Let  PM(x)  =  (1/N)£  p(x.;x) 

i  =  l  1 

be  the  N-point  density  estimate  evaluated  at  x,  where  p(x^;») 
is  the  estimation  kernel  (cf.  [l].  Chap.  6).  Then 


A 


(•) 


N-l 
N  *  N-l 


N  P(XN; 


and  at  x 


N  1 


PN('XN')  >  N  PN-1^XN)  +  N  PN-lfxN)  =  PN-1^XN^ 


N 


A 

since  _i (XN)  < P  (xN ;xN) .  The  last  inequality  follows 

A 

from  the  fact  that  pn-i(xn)  the  mean  of  N-l  terms  p(x^;xN) 
<p(x^;x^),  and  at  least  one  of  these  terms  is  strictly  less 
than  p(xN;xN).  Hence,  the  Parzen  estimate  for  a  set  of  N  data 
points  is  greater  than  that  obtained  from  any  (N-l) -point 
subset  of  them,  which  was  to  be  proved. 


3.  Conclusions  and  remarks 

We  have  shown  that  if  the  leaving-one-out  method  is 
employed  to  discriminate  a  set  of  data  points  from  one  which 
is  identical  to  it,  then  biased  results  will  always  be  obtained 
if  the  Gaussian  density  estimator  is  used.  In  particular,  the 
leakage  and  false  alarm  rates  will  be  100%  in  this  case. 
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If  other  density  estimators  are  used,  we  expect  the  same 
conclusion  to  hold  and  we  motivate  our  expectation  by  consi¬ 
deration  of  distributions  whose  densities  are  estimated  using 
the  Parzen  estimator.  If  noise  of  "small"  magnitude  is 
added  subsequently  to  the  data,  or  if  noise  of  "moderate" 
magnitude  is  added  to  a  small  number  of  the  components  of  the 
data,  one  can  see  that  the  inequality  of  density  estimates 
proved  in  Section  2  is  often  preserved,  so  that  the  L-method 
will  still  yield  biased  results.  To  remove  this  L-method 
bias  one  should  generate  the  two  classes  independently ,  so 
that  one  starts  with  classes  which  are  statistically,  but  not 
identically  equivalent.  Experiments  with  a  quadratic  classi¬ 
fier  in  which  this  procedure  was  applied  to  simulated 
Gaussian  data  and  to  non-Gaussian  data  from  a  sensitivity 
study  confirm  that  independent  generation  does  remove  the 
bias  of  the  L-method. 
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